updated: 2022-05-03_12:33:00-04:00

Data Analysis & Visualization

Capturing your audience and showing them what they want to see in one chart

Public Data https://archive.ics.uci.edu/ml/index.php

Information Visualization

appropriate visualization is important...

What is Data Science?
Using data to answer questions
Somebody who combines the skills of software programmer, statistician, and storyteller slash artist

Big Data

Volume, Velocity, Variety: Qualities of Big Data

Volume: Large datasets
Velocity: Data generated and collected very quickly
Variety: Different types of data available

What is Data? A set of values of qualitative or quantitative variables.

EMR: Electronic Medical Records (messy data)

Questions come before data

Ask a question before you start looking at data. Look for data that is relevant

R

getwd() # Pwd  
myfunction <- function(){  
	x <- rnorm(100) # 100 random numbers  
	mean(x) # return the mean  
}  
  
source("mean.r") # source file, replaces previous source file  
x # print x  
print(x) # print x  
  
msg <- "hello"  
  
x <- 1:30 # x now holds 1 2 3 4 ... 30  
  
x <- 20 # x is Numeric  
x <- 20L # x is integer  
  
typeof(x) # returns type, class seems to do the same thing  
  
x <- NaN # not a number  
  
x <- c('a', 'b', 'c') # concatination, replace x with a b c  
  
x <- 5 # x is a double  
x <- as.integer(x) # x is now an integer  
  
m <- matrix(nrow=2, ncol=3) # matrix initialized with NA  
dim(m) # print rows,columns  
  
attributes(m) # also prints rows and columns?  
  
n <- matrix( 1:25 ,5 ,5 ) # fills columns then rows  
# Output:  
#      [,1] [,2] [,3] [,4] [,5]  
# [1,]    1    6   11   16   21  
# [2,]    2    7   12   17   22  
# [3,]    3    8   13   18   23  
# [4,]    4    9   14   19   24  
# [5,]    5   10   15   20   25  
  
m <- 1:10  
# [1] 1 2 3 4 5 6 7 8 9 10  
dim(m) <- c(2,5) # change dimension of m  
#      [,1] [,2] [,3] [,4] [,5]  
# [1,]    1    3    5    7    9  
# [2,]    2    4    6    8   10  
  
x <- 1:3  
y <- 10:12  
cbind(x,y) # Column bind  
#      x  y  
# [1,] 1 10  
# [2,] 2 11  
# [3,] 3 12  
rbind(x,y) # Row bind  
#   [,1] [,2] [,3]  
# x    1    2    3  
# y   10   11   12  
  
x <- list(1, "a", TRUE) # Stores values as vectors of vectors  
# <span class="missing-link" style="color:darkred;">1</span>  
# [1] 1  
#  
# <span class="missing-link" style="color:darkred;">2</span>  
# [1] "a"  
#  
# <span class="missing-link" style="color:darkred;">3</span>  
# [1] TRUE  
  
x <- factor(c("yes","yes","no","yes","no")) # Levels will become labels?  
# [1] yes yes no  yes no  
# Levels: no yes  
  
class(x) # Check the class  
# "factor"  
  
unclass(x) # basically categorize  
# [1] 2 2 1 2 1  
# attr(,"levels")  
# [1] "no"  "yes"  
  
factor (c(...), levels = c("yes","no")) # manually defining levels  
  
is.na(x) # is value empty?  
is.nan(x) # is it NaN  
  
x <- data.frame (foo=1:4, bar=c(T,T,F,T)) # Matrix with labels  
#   foo   bar  
# 1   1  TRUE  
# 2   2  TRUE  
# 3   3 FALSE  
# 4   4  TRUE  
  
data <- read.csv("input.csv") # data from csv into data.frame  
  
sal <- max(data$salary) # calculate the max value from the data  dataframe from column salary  
  
help("read.csv") # pull up a help document  
  
x <- 1:3  
names(x) <- c("foo","bar","xyz") # name columns of x  
  
list(a=1,b=2) #name them right off the bet  
  
dimnames(x) <- list(c("a","b"), c("c","d")) # name rows then columns  
  
dput(x,file = "y.R") # text form of R object  
y <- deget("y.R") # R object from text  
  
x <- "foo"  
y <- data.frame(a=1, b="a")  
dump(c("x","y"),file = "data.R")  
rm(x,y)  
source("data.R")  
  
con <- file("input.txt","r") # read  
data <- read.csv(con)  
  
con <- url("https://swirlstats.com/students.html","r")  
x <- readLines(con,10)

<- is assignment operator
one source file per
Everything is an object (by default vector)
factors are usually for categorical data

Types of Data

char (string)
numeric AKA Double (real numbers, double precision)
integer
complex
logical
vector (only one kind of data)
list (array), can have more than one kind of data

Data Attributes

name
dimnames
dimensions
class
length
userdefined

Notation

double square brackets [[]] always returns a list
Single brackets, you can put a range in
You can also put in a condition ie:

x <- c("a","b","c","a","e","a")  
x[x>"a"]  
# [1] "b" "c" "e"  
u <- x >"a"  
# [1] FALSE TRUE TRUE FALSE TRUE FALSE  
x[u]  
# [1] "b" "c" "e"  
  
complete.cases(x,y) # truth table

I think a list is a collection of vectors
x _ y vs x % _ % y

Scope:

ls(environment(cube))
like scheme: functional programming
if it cannot find a variable locally, it will look globally etc higher and higher up

date/time

Data Preprocessing

This comes before data analysis

Raw -> Preprocessing -> clean data -> data analysis -> presentation

Data: A set of values of qualitative or quantitative variables, belonging to a set of items

Need to document the process for processing

Raw: No manipulation, at all

Make sure there's only one measured variable per column
one observation of said variable per row
one variable per table
one table per file

Project

Choose a dataset
Come up with hypothesis/questions
13th

Report of project:

Where is data coming from? (public)
1 paragraph what is it about?
5+ questions you want to answer with data.

Going to be a class presentation with an R file (well documented)

Regression Algorithms

Machine Learning regression algorith:
- Trying to draw conclusions from data
midterm: for every activity, for each

Clustering

Sentiment analysis

Close? -> distance
group
visualize groups
interpret

How do we measure distance?
Euclidian: traditional
Manhattan

Hierarchical Clustering

automatic, dendogram

K Means Clustering

partitioning approach
- centroids of each cluster

Principal Compound Analysis (Singular value decomposition)

Matrix can only hold one kind of data, Frame can store different types
demensionality reduction
X is a matrix
- when you apply the SVD function you get 3 vectors
- U: left singular
- D: diagonal
- V$^T$: (t is transpose, rows to columns...) right singular

iiD

independent identical distribution
P(A$\cap$B)= P(A) * P(B)
Don't multiply probabilities if not an iid

PDF

Probability Density Function
Area under PDFs corresponds to the probabilities for that random variable

To be a valid PDF:

It must be larger than or equal to zero everywhere
The total area under it must be 1

Area under a chunk of curve is the probability that something is in that chunk

Eg:
The proportion of help calls addressed in a random day:
f(x) = {2x for 0 < x < 1; 0 otherwise}
is this a valid PDF? (yes, area is 1)

Cumulative Distribution Function

A CDF of a random variable C, returns the probability that a random variable X is less than or equal to a value x
F(x)= P(X<=x)

Survival Function

A SF of a random variable C, returns the probability that a random variable X is greater than a value x

S(x)= P(X>x)
S(x) = 1 - F(x)

CDF
F(x)= P(X<=x)
= 1/2 _ base _ height
= 1/2 _ x _ 2x
= x^2

So probability of 40% or fewer calls answered in a single day is answered by CDF
x = 0.4
CDF = 0.4^2 = 0.16, or 16%

probably of one given roll is odd on a 6 sided die is P(A/B) = P(A$\Cap$B)/P(B)= 1/3

Regular

P(A|B)= P(A$\cap$B)/P(B)

Baye's Rule

P(B/A)

Probability of b given a (reverse)
$\frac{P(A|B)P(B)}{P(A|B)P(B) + P(A|B^{c})P(B^{c})}$

- and - are the events that are the result of a diagnostic test (positive and negative)
D and D$^c$ are the events that the subject of the test has or does not have the disease respectively
Sensitivity is the probability of getting a positive result given that the subject has the disease
P(+|D)
Specificity is the probability of getting a negative result given that the subject does not have the disease
P(-|D$^c$)
Positive predictive value is the probability of the subject having the disease given that the test is true
P(D | +)
Negative predictive value is the probability of the subject not having the disease given that the test is false
P(D$^c$ | -)
Prevalence of the disease is the marginal probability of the disease
P(D)

P(D|+), P(+|D), P(-|D$^c$)

if a subject from a population with 0.1010 prevalence receives a test, what is the positive predictive value?
P(D) = 0.01
P(D|+) = $\frac{P(+|D)P(D)}{P(+|D)P(D)+ P(+|D^c)P(D^c)}$

...

6.2% that the subject has the disease

Independence

Two events A and B are independent if P(A$\cap$B) = P(A)P(B)
equivalency if P(A|B)=P(A)

Two random variables, x and y are independent for two sets A and B
P([x $\subset$ A]$\cap$[y $\subset$ B]) = P(x$\subset$A)P(y$\subset$B)

if A is independent of B the probability that the test is true
A$^c$ is independent of B
A is independent of B$^c$
A$^c$ is independent of B$^c$

Expected Values

Looking at the characteristics of expected values
- How much of entire population does this sample represent?

The Process of making conclusions about a general population from sample data (from the general population)

Making conclusions: characteristics of the distribution

Mean is characterized as the center
Variance and SD are characterization of how spread out the distribution is

Population Mean

x -> random variable
PMF (probability mean function) -> P(x)
E[x] = SUM xP(x)

So for a fair coin:

x	P(x)
0	0.5
1	0.5
E(x)= 0 * 0.5 + 1 * 0.4 = 0.5

Sample Mean

$\bar{x}$ = $\sum\limits_{x=1}^{n}x_{i}P(x_{i})$
if we have n points that are equally likely:
- $P(x_{i})$ = $\frac{1}{n}$
- eg: a die

Biased

x	P(x)
0	P
1	1-P
E(x) = 0 * P + 1 * (1-P) = P

Expected value

Exactly the center of mass for the density function

Expected values are properties of distributions
The population mean is the center of mass of population
The sample mean is the center of mass of the sample/observed data
The sample mean is an estimate of the population mean
The Sample mean is unbiased
The more data that goes into the sample mean, the more concentrated its density/mass function

Density: continuous
Mass: discrete

Variance

How spread out or clustered data points are
X is a random variable with mean $\mu$
- variance of x = E[(x-$\mu$)$^2$]= E[$x^2$]-E[x]$^2$

x^2	x	P
4	2	0.1
9	3	0.2
16	4	0.3
25	5	0.4

E[x^2] = 4 * 0.1 + 9 * 0.2 + 16 * 0.3 + 25 * 0.4
E[x] = 2 * 0.1 + 3 * 0.2 + 4 * 0.3 + 5 * 0.4

Variance = E[x^2]- E(x)$^2$ = 1

Variance of d6
E(x) =3.5

x^2	x	P
1	1	1/6
4	2	1/6
9	3	1/6
16	4	1/6
25	5	1/6
36	6	1/6
E(x^2)= 1 * 1/6 + 4 * 1/6 ...36 * 1/6
E(x^2)-E(x)^2

should equal something like 2.92

Unbiased coin:
E(x) = P
E(x^2)=P
Variance should be P-P^2

P(1-P)

Sample Variance

If Population Mean is the center of mass of the population, the Sample Mean is the center of mass for observed data

The Population Variance is the expected squared distance of a random variable.

$\sigma^2$ is population variance

The Sample Variance is the average squared distance of the observed observations minus the sample mean

$s^2$ = $\sum_{i=1}(x_i-\bar{x})^2/n-1$

Sample variance is a function of data
- it is also a random variable
- it is also a population distribution
- that distribution has an expected value
- that expected value is the population variance
- that is what the sample variance is trying to estimate
As you collect more and more data, the sample variance gets more concentrated around the population variance

STD (sqrt variance)

Sample variance s^2 estimates the population variance $\sigma^2$
Distribution of sample variance is centered around $\sigma^2$
The variance of sample mean s^2/n
standard error is s/sqrt(n)
s is STD

But what is the standard error?
It talks about how variable averages of random samples of size n are from the population

If variance is 1, STD = sqrt(1) = 1

Poisson

s^2 = variance = 4
means of random samples of n
s/sqrt(n) = 2/sqrt(n)

Variance of faircoin:
s^2 = 0.25
means of random samples of n = s/sqrt n = 1/2sqrt(n)

variance is squared, so if we are looking for a range it would be the mean +- the STD

Distributions

Bernoulli Distribution

Binary Outcomes

Bernoulli random variables takes the values 1 and 0
with probability p and 1-p

Probability Mean Function (PMF) for a Bernoulli Random Variable:

$P(X=x)=p^x(1-p)^{1-x}$

mean: P
variance: P(1-p)

Binomial Trials

The binomial random variables are obtained as the sum of iid (independent and identically distributed) Bernoulli trials
let $x_{1}...x_{n}$ be iid Bernoulli

then $X=\sum\limits_{i=1}^{n}x_{i}$
where X is the binomial random variable

the binomial mass $P(X=x)=(^{n}_{x})p^{x}(1-p)^{n-x}$

$(^{n}_{x})=\frac{n!}{x!(n-x)!}$

Suppose a friend gets 7H from 8 flips of a fair coin...
If each outcome has an independent 50% prob.
what is the prob. of getting 7H or more in 8 flips?

$(^{8}{7})\cdot (0.5)^{7}(1-0.5)^{1}+(^{8}{8})\cdot (0.5)^{8}(1-0.5)^{0}\approx 0.04$

R code:

choose(8,7)* 0.5^n + choose(8,8)*0.5^8  
# or  
pbinom(6, size=8, prob=0.5, lower_tail=false)

Normal Distribution (Gaussian)

A random variable is said to follow normal or Gaussian distribution with mean $\mu$ and variance $\sigma^2$ is associated density function

bell curve = $(2\Pi\sigma^{2})^{-1/2}e^{-(x-\mu)^{2}/2\sigma^{2}}$

if x is a RV with this density, then E[x] = $\mu$
var(x) = $\sigma^{2}$

$X~N(\mu,\sigma^{2})$ when $\mu$ = 0 and r$^2$ = 1, then x is a standard normal random variable

Regression

Choosing a regression model

Least square method

x	y	xy	x^2
1	1.5	1.5	1
2	3.8	7.6	4
3	6.7	20.1	9
4	9.0	36	16
5	11.2	56	25
6	13.6	81.6	36
7	16	112	49
---	----	-----	---
28	61.8	314.8	140

slope is cor(x,y)* sd(y)/sd(x)

y = 2.41x - 0.83
x = 2 => y = 2.41 * 2 - 0.83 = 3.99
etc for x = 5,7

x <- c(1,2,3,4,5,6,7)  
y <- c(1.5, 3.8, 6.7 etc)  
cor(x,y) will give you a value for how they are correlated  
slope <- cor(x,y)*(sd(y)/sd(x))  
intercept <- mean(y) - (slope * mean (x))  
abline(lsfit(x,y),lwd=2, lty=2,col="blue")

You want to minimize the mean squared error (MSE)

For a particular line, we want to sum up the difference between the y value (actual for points we have) and the y value of the line we are testing

$\Sigma (y_{i}-\hat{y}_{i})^{2}/n$

Have to plot base plot first, then run manipulate on ggplot

Empirical mean
$\bar{x}= \frac{1}{n}\sum^{n}{i=1}x{i}$
if we subtract the mean from data poins we get data that has mean 0
$\tilde{x}{i}=x{i}-\bar{x}$
$\tilde{x}_{i}$ is a data point that has mean 0

We can get the least square solution by minimizing $\sum\limits^{n}{i=1}(x{i}-\mu)^2$

Empirical SD & Variance
$s^{2}=\frac{1}{n-1}\sum\limits^{n}{i=1}(x{i}-\bar{x})^2$
$s^{2}=\frac{1}{n-1}(\sum\limits^{n}{i=1}x{i}^2-n\bar{x}^2)$
$s=\sqrt{s^2}$ is standard deviation
$\frac{x_i}{s}$ => a data point that has standard deviation 1
Scaling the data

Regression (again)

Let's do home pricing:
y vs x (price vs. sqr. footage)

Training Data
Feature Extraction
ML Model
Quality Metric (maybe go back to manipulate model)
feed back to training data

Beta values are the coefficients of each power in a polynomial